The purpose of this project was to fit a series of regression models
to a dataset containing housing features and a corresponding sale price
as the response variable. Three models were constructed using both
R and JAGS. One of the JAGS
models used 3 features to predict sale prices while the final iteration
used 4 features. A number of evaluation metrics (including the deviation
information criterion (DIC)) were generated to gauge the accuracy of
each model. It was determined that incorporating the 4th feature reduced
the DIC and, hence, was a better for the data.
The Ames Housing dataset, which is available on Kaggle.com, was compiled by Dean De Cock for use in data science education. It’s an incredible dataset resource for data scientists and statisticians looking for a modernized and expanded version of the often-cited Boston Housing dataset.
The subject dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, along with the sale price of each home. For the purposes of this project, we will construct two models that leverage a subset (specifically, 3-4) of the most informative explanatory variables understand their ability to accurately predict sale prices for homes given their features.
The features to be included are:
LotArea - Lot size in square feetOverallCond - Rates the overall condition of the house
(10=Excellent, 1=Very Poor)GrLivArea - Above grade (ground) living area in square
feetWe will create an additional iteration of the model that incorporates
the feature TotRmsAbvGrd, the total number of rooms above
grade.
dat = read.csv(file="data_files/housing-price-data.csv", header=TRUE)
# Subset the raw data to features of interest and the response variable, SalePrice
dat1=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","SalePrice")]
# Code the categorical variable `HouseStyle`.
dat1$HouseStyle_coded = factor(dat1$HouseStyle)
newdat1 = dat1[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","SalePrice")]
head(newdat1)
## LotArea HouseStyle_coded OverallCond GrLivArea SalePrice
## 1 8450 2Story 5 1710 208500
## 2 9600 1Story 8 1262 181500
## 3 11250 2Story 5 1786 223500
## 4 9550 2Story 5 1717 140000
## 5 14260 2Story 5 2198 250000
## 6 14115 1.5Fin 5 1362 143000
Now, let’s look at the relationships among the features and their distributions:
library(Hmisc)
pairs(newdat1)
It looks like most, if not all, of the numeric features have a
somewhat linear relationship with SalePrice.
hist.data.frame(newdat1)
It also appears that all of our non-categorical variables (including
the response variable, SalePrice) have a somewhat normal
distribution.
Since each of the features appear to have linear relationship with
SalePrice, let’s start off by postulating a heirarchical
JAGS linear model of the descriptor features. We
will model with non-informative prior distributions (i.e., large \(\sigma^2\)) for the model
coefficients. The model will take the form of
\(y_i \sim N(\mu_i,\sigma^2),\space \space \space \mu_i = \beta_0+\beta x_{1i}+...+\beta_k x_{ki}, \space \space \space \beta_k \sim N(0, 1e6)\)
Where \(k\) is the number of descriptor variables in the data set and \(i\) is the number of observations.
Also, \(y_i\space| \space x_i,\beta,\sigma^2 \stackrel{ind}{\sim}N(\beta_0+\beta x_{1i}+...+\beta_k x_{ki},\sigma^2)\),
where the noninformative prior for \(\sigma^2\) is modeled using an \(InverseGamma(\alpha,\beta)\) distribution.
library("rjags")
newdat1 = na.omit(newdat1)
mod1_jags_string = " model {
for (i in 1:n) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
+b[4]*HouseStyle_coded1Story[i]+b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
+ b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:9) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
data_jags = list(y=newdat1$SalePrice,LotArea=newdat1$LotArea,OverallCond=newdat1$OverallCond,
HouseStyle_coded1.5Unf=as.numeric(newdat1$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat1$HouseStyle_coded=="1Story"),
HouseStyle_coded2.5Fin=as.numeric(newdat1$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat1$HouseStyle_coded=="2.5Unf"),
HouseStyle_coded2Story=as.numeric(newdat1$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat1$HouseStyle_coded=="SFoyer"),
HouseStyle_codedSLvl=as.numeric(newdat1$HouseStyle_coded=="SLvl"),n=nrow(newdat1))
params1 = c("b0","b", "sig")
inits1 = function() {
inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(9,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod1_jags = jags.model(textConnection(mod1_jags_string), data=data_jags, inits=inits1, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1460
## Unobserved stochastic nodes: 11
## Total graph size: 17039
##
## Initializing model
update(mod1_jags, 1000) # burn-in
mod1_jags_sim = coda.samples(model=mod1_jags,
variable.names=params1,
n.iter=5000)
mod1_jags_csim = do.call(rbind, mod1_jags_sim) # combine multiple chains
# gelman.diag(mod1_jags_sim)
# autocorr.diag(mod1_jags_sim)
# autocorr.plot(mod1_jags_sim)
# effectiveSize(mod1_jags_sim)
plot(mod1_jags_sim)
We can observe that the above traces for the \(b[i]\)s converge and the densities are all normal. Since we have convergence, we can conclude that our assumed prior probability distributions for the coefficients were reasonable. We can get a posterior summary of the parameters in our model.
summary(mod1_jags_sim)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 4.516 0.218 0.00178 0.003037
## b[2] 18723.683 543.422 4.43702 8.117981
## b[3] -69.654 1001.964 8.18101 8.138457
## b[4] 1846.208 977.089 7.97790 8.153038
## b[5] 30.843 997.046 8.14085 8.084877
## b[6] -16.013 999.438 8.16038 8.160779
## b[7] 3103.923 982.738 8.02402 8.470398
## b[8] -59.084 996.006 8.13236 8.132493
## b[9] 57.976 992.552 8.10416 8.104302
## b0 4390.004 987.233 8.06072 9.364645
## sig 87117.009 1735.543 14.17065 17.676296
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 4.092 4.37 4.514 4.665 4.942
## b[2] 17642.480 18360.56 18726.433 19086.084 19777.445
## b[3] -1995.871 -748.86 -61.569 601.272 1917.174
## b[4] -45.896 1192.36 1840.503 2502.362 3770.414
## b[5] -1925.142 -645.30 32.594 700.720 1985.163
## b[6] -1986.272 -699.86 -15.171 665.299 1926.755
## b[7] 1192.001 2433.99 3103.429 3767.202 5046.794
## b[8] -2021.505 -736.43 -61.910 613.709 1894.689
## b[9] -1895.159 -598.33 54.210 712.913 1978.816
## b0 2451.783 3732.17 4376.803 5052.212 6329.450
## sig 83794.508 85933.80 87096.221 88261.711 90668.208
We notice that the standard deviation is quite high for the
coefficients associated with the coded categorical variable
HouseStyle_coded. This makes sense since the \(x_i\) for this feature can only take on
values of \(0\) or \(1\).
In a Bayesian model, we have distributions for residuals, but we’ll simplify and look only at the residuals evaluated at the posterior mean of the parameters.
X = cbind(rep(1.0, data_jags$n), data_jags$LotArea,data_jags$OverallCond,data_jags$HouseStyle_coded1.5Unf,
data_jags$HouseStyle_coded1Story,data_jags$HouseStyle_coded2.5Fin,data_jags$HouseStyle_coded2.5Unf,
data_jags$HouseStyle_coded2Story,data_jags$HouseStyle_codedSFoyer,data_jags$HouseStyle_codedSLvl)
head(X)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 8450 5 0 0 0 0 1 0 0
## [2,] 1 9600 8 0 1 0 0 0 0 0
## [3,] 1 11250 5 0 0 0 0 1 0 0
## [4,] 1 9550 5 0 0 0 0 1 0 0
## [5,] 1 14260 5 0 0 0 0 1 0 0
## [6,] 1 14115 5 0 0 0 0 0 0 0
(pm_params = colMeans(mod1_jags_csim)) # posterior mean
## b[1] b[2] b[3] b[4] b[5] b[6]
## 4.515934 18723.683170 -69.654493 1846.207866 30.842949 -16.012544
## b[7] b[8] b[9] b0 sig
## 3103.923261 -59.084345 57.976368 4390.004102 87117.009040
Create an inference vector, \(\hat{y}\), generated by the model and examine the residuals.
yhat = drop(X %*% pm_params[1:10])
resid = data_jags$y - yhat
plot(resid) # residuals against data index
plot(yhat, resid) # residuals against predicted values
qqnorm(resid) # checking normality of residuals
We see that the residuals are mostly concentrated about zero, however, there are a few outliers. Having this information would have us conclude that there are other features that could be added to the model to reduce the error of its inference generation.
As mentioned previously, we will build another linear model that
incorporates the numeric feature TotRmsAbvGrd, which is the
total rooms above grade (does not include bathrooms).
#Adjust the dataset to include `TotRmsAbvGrd`.
dat2=dat[,c("LotArea","HouseStyle","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
dat2$HouseStyle_coded = factor(dat2$HouseStyle)
newdat2 = dat2[,c("LotArea","HouseStyle_coded","OverallCond","GrLivArea","TotRmsAbvGrd","SalePrice")]
# head(newdat2)
The JAGS model becomes:
mod2_jags_string = " model {
for (i in 1:n) {
y[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1]*LotArea[i] + b[2]*OverallCond[i]+ b[3]*HouseStyle_coded1.5Unf[i]
+ b[4]*HouseStyle_coded1Story[i]+ b[5]*HouseStyle_coded2.5Fin[i]+ b[6]*HouseStyle_coded2.5Unf[i]
+ b[7]*HouseStyle_coded2Story[i]+ b[8]*HouseStyle_codedSFoyer[i]+ b[9]*HouseStyle_codedSLvl[i]
+ b[10]*TotRmsAbvGrd[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:10) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(5/2.0, 5*10.0/2.0)
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
data2_jags = list(y=newdat2$SalePrice, LotArea=newdat2$LotArea,OverallCond=newdat2$OverallCond,
HouseStyle_coded1.5Unf=as.numeric(newdat2$HouseStyle_coded=="1.5Unf"),HouseStyle_coded1Story=as.numeric(newdat2$HouseStyle_coded=="1Story"),
HouseStyle_coded2.5Fin=as.numeric(newdat2$HouseStyle_coded=="2.5Fin"),HouseStyle_coded2.5Unf=as.numeric(newdat2$HouseStyle_coded=="2.5Unf"),
HouseStyle_coded2Story=as.numeric(newdat2$HouseStyle_coded=="2Story"),HouseStyle_codedSFoyer=as.numeric(newdat2$HouseStyle_coded=="SFoyer"),
HouseStyle_codedSLvl=as.numeric(newdat2$HouseStyle_coded=="SLvl"),TotRmsAbvGrd=dat2$TotRmsAbvGrd,
n=nrow(newdat2))
params1 = c("b0","b", "sig")
inits1 = function() {
inits = list("b0"=rnorm(1,0.0,100.0), "b"=rnorm(10,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
mod2_jags = jags.model(textConnection(mod2_jags_string), data=data2_jags, inits=inits1, n.chains=3)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1460
## Unobserved stochastic nodes: 12
## Total graph size: 18584
##
## Initializing model
update(mod2_jags, 1000) # burn-in
mod2_jags_sim = coda.samples(model=mod2_jags,
variable.names=params1,
n.iter=5000)
mod2_jags_csim = do.call(rbind, mod2_jags_sim) # combine multiple chains
# gelman.diag(mod2_jags_sim)
# autocorr.diag(mod2_jags_sim)
# autocorr.plot(mod2_jags_sim)
# effectiveSize(mod2_jags_sim)
plot(mod2_jags_sim)
summary(mod2_jags_sim)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## b[1] 2.188 0.1776 0.00145 0.002544
## b[2] 6339.411 639.7799 5.22378 14.803242
## b[3] -79.366 998.2203 8.15043 8.150603
## b[4] 1825.850 958.1720 7.82344 8.176910
## b[5] -84.747 1001.2881 8.17548 8.050979
## b[6] -121.467 997.8990 8.14781 8.253272
## b[7] 1653.969 964.3769 7.87410 8.029389
## b[8] -43.810 1000.8695 8.17206 8.111534
## b[9] -57.073 983.0139 8.02627 8.041809
## b[10] 17404.354 586.7732 4.79098 14.002838
## b0 1848.166 987.4110 8.06218 9.582944
## sig 68006.059 1310.8944 10.70341 13.345699
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## b[1] 1.843 2.068 2.186 2.307 2.543
## b[2] 5095.853 5898.568 6329.896 6775.045 7595.199
## b[3] -2022.736 -751.697 -89.833 600.097 1888.300
## b[4] -13.195 1171.850 1832.692 2465.480 3720.188
## b[5] -2027.863 -753.905 -84.041 590.717 1876.619
## b[6] -2080.128 -790.109 -119.393 554.646 1816.294
## b[7] -263.788 1005.980 1660.790 2301.394 3535.793
## b[8] -1994.089 -720.581 -43.400 638.966 1884.774
## b[9] -1985.461 -714.932 -61.421 615.326 1852.118
## b[10] 16262.382 17009.180 17405.388 17803.185 18540.086
## b0 -68.407 1184.027 1831.996 2513.628 3791.564
## sig 65463.421 67105.530 67979.246 68876.941 70658.796
The model coefficient for TotRmsAbvGrd, b[10], is quite
large as compared with the other coefficients, indicating that this
feature is a stronger driver of SalePrice. As with the
previous model, the trace plots show convergence for all the
coefficients and normal densities.
Let’s check the residuals for this model.
X = cbind(rep(1.0, data2_jags$n), data2_jags$LotArea,data2_jags$OverallCond,data2_jags$HouseStyle_coded1.5Unf,
data2_jags$HouseStyle_coded1Story,data2_jags$HouseStyle_coded2.5Fin,data2_jags$HouseStyle_coded2.5Unf,
data2_jags$HouseStyle_coded2Story,data2_jags$HouseStyle_codedSFoyer,data2_jags$HouseStyle_codedSLvl,data2_jags$TotRmsAbvGrd)
head(X)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 1 8450 5 0 0 0 0 1 0 0 8
## [2,] 1 9600 8 0 1 0 0 0 0 0 6
## [3,] 1 11250 5 0 0 0 0 1 0 0 6
## [4,] 1 9550 5 0 0 0 0 1 0 0 7
## [5,] 1 14260 5 0 0 0 0 1 0 0 9
## [6,] 1 14115 5 0 0 0 0 0 0 0 5
(pm_params = colMeans(mod2_jags_csim)) # posterior mean
## b[1] b[2] b[3] b[4] b[5] b[6]
## 2.187958 6339.411218 -79.365965 1825.849805 -84.746797 -121.467282
## b[7] b[8] b[9] b[10] b0 sig
## 1653.969007 -43.809989 -57.072815 17404.354319 1848.166374 68006.059051
Create an inference vector, \(\hat{y}\), generated by the model and examine the residuals.
yhat = drop(X %*% pm_params[1:11])
resid = data2_jags$y - yhat
plot(resid) # residuals against data index
plot(yhat, resid) # residuals against predicted values
qqnorm(resid) # checking normality of residuals
We see that these plots are very similar to those generated by the first
model, however, the magnitude of the residual outliers for the second
model is much less than those generated by the first model. We can
conclude that adding the additional feature has indeed improved the
model’s prediction capability.
Let’s compare the two models using the DIC metric:
dic.samples(mod1_jags, n.iter=1e3)
## Mean deviance: 37363
## penalty 3.182
## Penalized deviance: 37366
dic.samples(mod2_jags, n.iter=1e3)
## Mean deviance: 36640
## penalty 3.599
## Penalized deviance: 36643
Upon examination of the deviance information criterion (DIC), the
penalized deviance of the second model (which incorporates
TotRmsAbvGrd) is lower than the first model. We conclude
that the second model that incorporates the additional feature is a
better fit (and predictor) for the data. Further confirmation of this
conclusion is found in the relatively larger magnitude in the b[10]
coefficient.